Review a simple linear regression using Bayesian Methods
How to fit a linear multiple regression (Stan) using Bayesian
Methods
Cases of significant/insignificant, correlated and collinear
predictors
Shrinkage
Bayesian model with or without shrinkage, ridge regression and lasso
regression: An example of SAT scores
We can see a post how to fit a simple linear regression in both Frequentist approach and Bayesian methods. Now we move on to the linear multiple regression.
Generate data for multiple linear regression with 2 independent significant predictors.
# generate data
set.seed(03182021)
<- 500
Ntotal <- cbind(rnorm(Ntotal, mean = 20, sd = 4),
x rnorm(Ntotal, mean=10, sd = 6))
<- ncol(x)
Nx <- 4 + 1.1*x[,1] + 3*x[,2] + rnorm(Ntotal, mean = 0, sd = 1) y
Create a data list.
<- list(Ntotal = Ntotal, y = y, x = as.matrix(x), Nx = Nx) dataListRegression
Here’s a model in the Frequentist method:
summary(lm(y~x))
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-2.83480 -0.63696 0.00131 0.69896 2.48803
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.565419 0.230456 19.81 <2e-16 ***
x1 1.074206 0.010945 98.14 <2e-16 ***
x2 2.995385 0.007954 376.57 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9759 on 497 degrees of freedom
Multiple R-squared: 0.9969, Adjusted R-squared: 0.9968
F-statistic: 7.895e+04 on 2 and 497 DF, p-value: < 2.2e-16
The diagram of the model shows hierarchical structure with normal priors for intercept \(\beta_0\) and slopes \(\beta_i\), \(i=1,2\).
Description of the model:
\[y_i = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \epsilon_i \\ y_i \sim t(\mu_i, \sigma, \nu) \\ \\ \text{where mean, scale and degree of freedom, respectively, would be} \\ \mu_i = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} \\ \beta_0 \sim N(M_0, S_0) \ ; \ \beta_j \sim N(M_1, S_1) \\ \sigma \sim Unif(L, H) \\ \nu = \gamma^{\prime} + 1, \ \text{where} \ \gamma^{\prime} \sim exp(\lambda)\]
Now, we come back to the problem. Here’s we write a model string:
<-"
modelStringdata {
int<lower=1> Ntotal;
int<lower=1> Nx;
vector[Ntotal] y;
matrix[Ntotal, Nx] x;
}
transformed data {
real meanY;
real sdY;
vector[Ntotal] zy; // normalized
vector[Nx] meanX;
vector[Nx] sdX;
matrix[Ntotal, Nx] zx; // normalized
meanY = mean(y);
sdY = sd(y);
zy = (y - meanY) / sdY;
for (j in 1:Nx) {
meanX[j] = mean(x[,j]);
sdX[j] = sd(x[,j]);
for ( i in 1:Ntotal ) {
zx[i,j] = (x[i,j] - meanX[j]) / sdX[j];
}
}
}
parameters {
real zbeta0;
vector[Nx] zbeta;
real<lower=0> nu;
real<lower=0> zsigma;
}
transformed parameters{
vector[Ntotal] zy_hat;
zy_hat = zbeta0 + zx * zbeta;
}
model {
zbeta0 ~ normal(0, 2);
zbeta ~ normal(0, 2);
nu ~ exponential(1/30.0);
zsigma ~ uniform(1.0E-5 , 1.0E+1);
zy ~ student_t(1+nu, zy_hat, zsigma);
}
generated quantities {
// Transform to original scale:
real beta0;
vector[Nx] beta;
real sigma;
// .* and ./ are element-wise product and divide
beta0 = zbeta0*sdY + meanY - sdY * sum( zbeta .* meanX ./ sdX );
beta = sdY * ( zbeta ./ sdX );
sigma = zsigma * sdY;
} "
<- stan_model(model_code=modelString) RobustMultipleRegressionDso
If saved DSO is used load it, then run the chains.
# saveRDS(RobustMultipleRegressionDso, file="data/DSORobustMultRegr.Rds")
<- readRDS("data/DSORobustMultRegr.Rds") RobustMultipleRegressionDso
Fit the model.
<- sampling(RobustMultipleRegressionDso,
fit1 data=dataListRegression,
pars=c('beta0', 'beta', 'nu', 'sigma'),
iter = 5000, chains = 2, cores = 2)
stan_ac(fit1)
stan_trace(fit1)
Look at the results.
summary(fit1)$summary[,c(1,3,4,5,8,9)] #mean, sd, 2.5%, 97.5%, n_eff
mean sd 2.5% 25%
beta0 4.5780310 0.231357219 4.1304172 4.422841
beta[1] 1.0738282 0.011006048 1.0521979 1.066428
beta[2] 2.9951527 0.008047017 2.9797451 2.989759
nu 61.7987150 36.009712586 17.0095629 36.082778
sigma 0.9611543 0.032896641 0.8978996 0.938909
lp__ 1014.2033841 1.585209650 1010.1168812 1013.426674
97.5% n_eff
beta0 5.025614 7033.830
beta[1] 1.095054 6561.254
beta[2] 3.011046 6871.816
nu 151.390448 4557.868
sigma 1.027262 4328.823
lp__ 1016.293483 2600.401
pairs(fit1,pars=c("beta0","beta[1]","beta[2]"))
plot(fit1,pars="nu")
plot(fit1,pars="sigma")
plot(fit1,pars="beta0")
plot(fit1,pars="beta[1]")
plot(fit1,pars="beta[2]")
Analyze fitted model using shinystan
launch_shinystan(fit1)
Conclusions:
A Stan
program has three required “blocks”:
data
block: where you declare the data types, their
dimensions, any restrictions (i.e. upper = or lower = , which act as
checks for Stan
), and their names. Any names you give to
your Stan
program will also be the names used in other
blocks.
parameters
block: This is where you indicate the
parameters you want to model, their dimensions, restrictions, and name.
For a linear regression, we will want to model the intercept, any
slopes, and the standard deviation of the errors around the regression
line.
model
block: This is where you include any sampling
statements, including the “likelihood” (model) you are using. The model
block is where you indicate any prior distributions you want to include
for your parameters. If no prior is defined, Stan
uses
default priors with the specifications
uniform(-infinity, +infinity)
. You can restrict priors
using upper or lower when declaring the parameters
(i.e. <lower = 0>
to make sure a parameter is
positive). You can find more information about prior specification here.
Sampling is indicated by the ~
symbol, and Stan already
includes many common distributions as vectorized functions. You can
check out the
manual for a comprehensive list and more information on the optional
blocks you could include in your Stan
model.
There are also four optional blocks:
functions
transformed data
: allows for preprocessing of the
datatransformed parameters
: allows for parameter processing
before the posterior is computedObjects declared in the “transformed parameters” block of a Stan program are:
model
block, although in hierarchical
models the line between the prior and the likelihood can be drawn in
multiple ways(if the third point is not the case, the object should usually be
declared in the generated quantities
block of a Stan
program)
The purpose of declaring such things in the
transformed parameters
block rather than the parameters
block is often to obtain more efficient sampling from the posterior
distribution. If there is a posterior PDF \(f(\theta \mid \text{data})\), then for any
objective transformation from \(\alpha\) to \(\theta\), the posterior PDF of \(\alpha\) is simply \(f(\)\((\)\() \mid
data)\text{abs}|J|\), where \(|J|\) is the determinant of the Jacobian
matrix of the transformation from \(\alpha\) to \(\theta\). Thus, you can make the same
inferences about (functions of) \(\theta\) either by drawing from the
posterior whose PDF is \(f(\theta \mid
\text{data})\) where \(\theta\)
are the parameters or the posterior whose PDF is f(θ(α)|data)abs|J|
where α are parameters and θ are transformed parameters. Since the
posterior inferences about (functions of) θ are the same, you are free
to choose a transformation that enhances the efficiency of the sampling
by making α less correlated, unit scaled, more Gaussian, etc. than is
θ.
\(\Rightarrow\) Transformation to improve fit Comments are indicated by
//
in Stan.
The
write("model code", "file_name")
bit allows us to write the
Stan model in our R script and output the file to the working directory
(or you can set a different file path)
<- as.matrix(read.csv("data/DtSim4RegANOVA.csv", header=TRUE, sep=","))
Regression.Data tail(Regression.Data)
Output Input1 Input2
[495,] 2.4054442 0.9276934 0.07278244
[496,] 1.8663026 -0.3678520 1.51715986
[497,] 1.3590146 0.5369795 0.96209003
[498,] 3.1836007 1.0171332 -0.56660564
[499,] 2.3615061 1.1637966 0.07815352
[500,] 0.8483407 1.1775607 1.59720356
Prepare the data for Stan.
<- nrow(Regression.Data)
Ntotal <- Regression.Data[ ,2:3]
x tail(x)
Input1 Input2
[495,] 0.9276934 0.07278244
[496,] -0.3678520 1.51715986
[497,] 0.5369795 0.96209003
[498,] 1.0171332 -0.56660564
[499,] 1.1637966 0.07815352
[500,] 1.1775607 1.59720356
<- ncol(x)
Nx <- Regression.Data[ ,1]
y <- list(Ntotal=Ntotal,
dataListInsig y=y,
x=as.matrix(x),
Nx=Nx)
Run MCMC using the same DSO.
<- sampling(RobustMultipleRegressionDso, data=dataListInsig,
fit2 pars=c('beta0', 'beta', 'nu', 'sigma'),
iter=5000, chains = 2, cores = 2)
launch_shinystan(fit2)
Analyze the results.
summary(fit2)$summary[,c(1,3,4,8,9)] #mean, sd, 2.5%, 97.5%, n_eff
mean sd 2.5% 97.5%
beta0 1.217597e+00 0.05182680 1.11612951 1.31636194
beta[1] 7.999726e-01 0.02812159 0.74453626 0.85437808
beta[2] 9.416795e-03 0.02736786 -0.04397004 0.06323134
nu 5.276307e+01 34.06776500 13.74259262 142.96095320
sigma 5.979875e-01 0.02144590 0.55558445 0.64151495
lp__ -1.817471e+02 1.53474481 -185.40442023 -179.67956203
n_eff
beta0 5838.281
beta[1] 6158.199
beta[2] 5787.610
nu 4567.018
sigma 4814.577
lp__ 2650.410
pairs(fit2,pars=c("beta0","beta[1]","beta[2]"))
plot(fit2,pars="nu")
plot(fit2,pars="sigma")
plot(fit2,pars="beta0")
plot(fit2,pars="beta[1]")
plot(fit2,pars="beta[2]")
We see that parameter \(\beta_2\) is
not significant.
However, there is no strong correlation or redundancy between the
predictors.
Compare with the output of linear model
pairs(Regression.Data)
summary(lm(Output~., data=as.data.frame(Regression.Data)))
Call:
lm(formula = Output ~ ., data = as.data.frame(Regression.Data))
Residuals:
Min 1Q Median 3Q Max
-1.76631 -0.39358 -0.01411 0.40432 1.91861
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.21480 0.05179 23.458 <2e-16 ***
Input1 0.80116 0.02819 28.423 <2e-16 ***
Input2 0.00970 0.02787 0.348 0.728
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6107 on 497 degrees of freedom
Multiple R-squared: 0.6204, Adjusted R-squared: 0.6188
F-statistic: 406.1 on 2 and 497 DF, p-value: < 2.2e-16
set.seed(03192021)
<- 500
Ntotal <- rnorm(Ntotal, mean = 20, sd = 4)
x1 <- 1 - 1.5*x1 + rnorm(Ntotal, mean=0, sd = .1)
x2 <- cbind(x1,x2)
x
plot(x)
<- ncol(x)
Nx <- 4 + .2*x[,1] + 3*x[,2]+rnorm(Ntotal, mean = 0, sd = 1)
y plot(x[,1],y)
plot(x[,2],y)
<-lm(y~x[,1]+x[,2])
fitlmsummary(fitlm)
Call:
lm(formula = y ~ x[, 1] + x[, 2])
Residuals:
Min 1Q Median 3Q Max
-2.7765 -0.7585 0.0172 0.6577 3.1721
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.5619 0.4878 9.353 < 2e-16 ***
x[, 1] -0.5468 0.6707 -0.815 0.415
x[, 2] 2.5034 0.4479 5.589 3.76e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9988 on 497 degrees of freedom
Multiple R-squared: 0.9962, Adjusted R-squared: 0.9962
F-statistic: 6.53e+04 on 2 and 497 DF, p-value: < 2.2e-16
drop1(fitlm)
Single term deletions
Model:
y ~ x[, 1] + x[, 2]
Df Sum of Sq RSS AIC
<none> 495.85 1.8338
x[, 1] 1 0.6632 496.51 0.5021
x[, 2] 1 31.1702 527.02 30.3165
<- list(Ntotal=Ntotal, y=y, x=as.matrix(x), Nx=Nx) dataListShrink2
Note that actual coefficient for x[,1]
is \(+0.2\), but slope on the plot plot(x[,1],y)
is negative.
Also note that estimated model coefficients are different from
actual because of correlation:
cbind(actual=c(4,.2,3),estimated=fitlm$coefficients)
actual estimated
(Intercept) 4.0 4.5619222
x[, 1] 0.2 -0.5468278
x[, 2] 3.0 2.5034438
Run the chains and analyze the results.
<-proc.time()
tStart<-sampling(RobustMultipleRegressionDso,
fit3data = dataListShrink2,
pars = c('beta0', 'beta', 'nu', 'sigma'),
iter = 5000, chains = 2, cores = 2)
<-proc.time()
tEnd-tStart tEnd
user system elapsed
0.51 0.39 92.09
How long did it take to run this MCMC? Why so long?
Monte Carlo methods assume that the samples are independent. This is usually not the case for sequential draws in Markov Chain Monte Carlo (MCMC) sampling, motivating the use of thinning. In thinning, we keep every T sample (\(P(x \le t)\)) and throw away the other samples. For some types of MCMC (notably, Gibbs sampling), highly correlated variables result in highly correlated samples (i.e., definitely not independent), requiring a large T to compensate. Large values of T mean a large computational cost for each effective sample (i.e., for each sample that is kept).
Check convergence in shiny.
launch_shinystan(fit3)
stan_dens(fit3)
stan_ac(fit3, separate_chains = T)
summary(fit3)$summary[,c(1,3,4,8,9)]
mean sd 2.5% 97.5% n_eff
beta0 4.5857595 0.48813790 3.6314232 5.5544965 2114.939
beta[1] -0.5967437 0.67710628 -1.9314954 0.7388229 1713.891
beta[2] 2.4699613 0.45229741 1.5765319 3.3619226 1712.967
nu 56.0426640 33.61895666 15.8382163 143.5659182 2718.137
sigma 0.9816801 0.03355264 0.9166907 1.0483195 2731.866
lp__ 967.3463697 1.56455325 963.5132940 969.3727437 1911.020
pairs(fit3,pars=c("beta0","beta[1]","beta[2]"))
plot(fit3,pars="nu")
plot(fit3,pars="sigma")
plot(fit3,pars="beta0")
plot(fit3,pars="beta[1]")
plot(fit3,pars="beta[2]")
General signs of collinear predictors:
- High correlation between slopes (compensating sign)
- Wide posterior distributions for slopes
- Increased autocorrelation for slopes
pairs(cbind(y,x1,x2))
cbind(actual=c(4,.2,3),estimatedLm=fitlm$coefficients,estimatedBayes=summary(fit3)$summary[1:3,1])
actual estimatedLm estimatedBayes
(Intercept) 4.0 4.5619222 4.5857595
x[, 1] 0.2 -0.5468278 -0.5967437
x[, 2] 3.0 2.5034438 2.4699613
Linear model shows the same information as Bayesian.
In case when predictors have strong collinearity, linear model may
stop working.
Simulate the same model as in the previous section, but make predictors
collinear.
set.seed(03192021)
<- 500
Ntotal <- rnorm(Ntotal, mean = 20, sd = 4)
x1 <-1-1.5*x1+rnorm(Ntotal, mean=0, sd = .000001) # sd closes to 0
x2<-cbind(x1,x2)
xplot(x)
<- ncol(x)
Nx <- 4 + .2*x[,1] + 3*x[,2]+rnorm(Ntotal, mean = 0, sd = 1)
y plot(x[,1],y)
plot(x[,2],y)
<- list(Ntotal=Ntotal, y=y, x=as.matrix(x), Nx=Nx)
dataListShrink2c <- lm(y~x1+x2)) (lmFit
Call:
lm(formula = y ~ x1 + x2)
Coefficients:
(Intercept) x1 x2
7.094 -4.303 NA
summary(lmFit)
Call:
lm(formula = y ~ x1 + x2)
Residuals:
Min 1Q Median 3Q Max
-2.7626 -0.7636 0.0244 0.6410 3.1747
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.09394 0.24447 29.02 <2e-16 ***
x1 -4.30334 0.01189 -361.96 <2e-16 ***
x2 NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9991 on 498 degrees of freedom
Multiple R-squared: 0.9962, Adjusted R-squared: 0.9962
F-statistic: 1.31e+05 on 1 and 498 DF, p-value: < 2.2e-16
drop1(lmFit)
Single term deletions
Model:
y ~ x1 + x2
Df Sum of Sq RSS AIC
<none> 497.08 1.0687
x1 0 0.0001417 497.08 1.0688
x2 0 0.0000000 497.08 1.0687
Linear model stops working.
Simulate Markov chains.
cbind(actual=c(4,.2,3),estimated=fitlm$coefficients)
actual estimated
(Intercept) 4.0 4.5619222
x[, 1] 0.2 -0.5468278
x[, 2] 3.0 2.5034438
Run the chains and analyze the results.
<- proc.time()
tStart <- sampling(RobustMultipleRegressionDso,
fit3c data=dataListShrink2c,
pars=c('beta0', 'beta', 'nu', 'sigma'),
iter=5000, chains = 1, cores = 2)
SAMPLING FOR MODEL '8e2e8be6f09a2541ae94da0604a417ec' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 0 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 5000 [ 0%] (Warmup)
Chain 1: Iteration: 500 / 5000 [ 10%] (Warmup)
Chain 1: Iteration: 1000 / 5000 [ 20%] (Warmup)
Chain 1: Iteration: 1500 / 5000 [ 30%] (Warmup)
Chain 1: Iteration: 2000 / 5000 [ 40%] (Warmup)
Chain 1: Iteration: 2500 / 5000 [ 50%] (Warmup)
Chain 1: Iteration: 2501 / 5000 [ 50%] (Sampling)
Chain 1: Iteration: 3000 / 5000 [ 60%] (Sampling)
Chain 1: Iteration: 3500 / 5000 [ 70%] (Sampling)
Chain 1: Iteration: 4000 / 5000 [ 80%] (Sampling)
Chain 1: Iteration: 4500 / 5000 [ 90%] (Sampling)
Chain 1: Iteration: 5000 / 5000 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 155.086 seconds (Warm-up)
Chain 1: 114.928 seconds (Sampling)
Chain 1: 270.014 seconds (Total)
Chain 1:
<- proc.time()
tEnd -tStart tEnd
user system elapsed
261.57 0.17 270.09
With collinear predictors model definitely takes much longer time to simulate.
stan_dens(fit3c)
stan_ac(fit3c, separate_chains = T)
summary(fit3c)$summary[,c(1,3,4,8,9)]
mean sd 2.5% 97.5% n_eff
beta0 5.2869784 4.12513863 -2.5410297 13.834655 258.6426
beta[1] -1.5985501 6.17402618 -14.3329219 10.098421 255.8424
beta[2] 1.8031011 4.11591117 -6.6807200 9.601809 255.7672
nu 57.5235574 34.84527395 15.7981673 148.866690 605.1602
sigma 0.9834476 0.03440985 0.9179121 1.052713 331.5993
lp__ 967.5976423 1.48319659 964.0943396 969.579397 845.7838
pairs(fit3c,pars=c("beta0","beta[1]","beta[2]"))
plot(fit3c,pars="nu")
plot(fit3c,pars="sigma")
plot(fit3c,pars="beta0")
plot(fit3c,pars="beta[1]")
plot(fit3c,pars="beta[2]")
Markov chains may go over limit on tree depths (yellow dots on pairs
graph).
But Bayesian method still works. It shows that one or both of the slopes
are not significantly different from zero.
When there are many candidate predictors in the model it may be
useful to “motivate” them to become closer to zero if they are not very
strong.
One way to do it is to:
Small \(\sigma\) forces slopes to shrink towards zero mean. At the same time, small \(\nu\) makes the tails fat enough to allow some strong slopes to be outliers.
Parameter \(\sigma\) of the prior for regression coefficients \(\beta_j\) can be either fixed, or given its own prior and estimated.
In the former case, all coefficients will be forced to have the same regularizator, if it is random and estimated from the same data then there is mutual influence between \(\sigma\) and regression coefficients: if many of them are close to zero then \(\sigma\) is going to be smaller, which in turn pushes coefficients even closer to zero.
These approaches reminds us the Ridge/LASSO/Elastic net regression in Machine Learning models.
Use the same data dataListRegression
as in the above
section.
Describe the model.
\[y_i = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \epsilon_i \\ y_i \sim t(\mu_i, \sigma, \nu) \\ \\ \text{where mean, scale and degree of freedom, respectively, would be} \\ \mu_i = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} \\ \beta_0 \sim N(M_0, S_0) \ ; \ \beta_j \sim t(\mu_j, \nu_j, \sigma_{\beta}) \\ \sigma_{\beta} \sim gamma(shape,rate) \\ \sigma \sim Unif(L, H) \\ \nu = \gamma^{\prime} + 1, \ \text{where} \ \gamma^{\prime} \sim exp(\lambda)\]
<-"
modelStringdata {
int<lower=1> Ntotal;
int<lower=1> Nx;
vector[Ntotal] y;
matrix[Ntotal, Nx] x;
}
transformed data {
real meanY;
real sdY;
vector[Ntotal] zy; // normalized
vector[Nx] meanX;
vector[Nx] sdX;
matrix[Ntotal, Nx] zx; // normalized
meanY = mean(y);
sdY = sd(y);
zy = (y - meanY) / sdY;
for (j in 1:Nx) {
meanX[j] = mean(x[,j]);
sdX[j] = sd(x[,j]);
for (i in 1:Ntotal) {
zx[i,j] = (x[i,j] - meanX[j]) / sdX[j];
}
}
}
parameters {
real zbeta0;
real<lower=0> sigmaBeta;
vector[Nx] zbeta;
real<lower=0> nu;
real<lower=0> zsigma;
}
transformed parameters{
vector[Ntotal] zy_hat;
zy_hat = zbeta0 + zx * zbeta;
}
model {
zbeta0 ~ normal(0, 2);
sigmaBeta ~ gamma(2.3,1.3); // mode=(alpha-1)/beta, var=alpha/beta^2
zbeta ~ student_t(1.0/30.0, 0, sigmaBeta);
nu ~ exponential(1/30.0);
zsigma ~ uniform(1.0E-5 , 1.0E+1);
zy ~ student_t(1+nu, zy_hat, zsigma);
}
generated quantities {
// Transform to original scale:
real beta0;
vector[Nx] beta;
real sigma;
// .* and ./ are element-wise product and divide
beta0 = zbeta0*sdY + meanY - sdY * sum( zbeta .* meanX ./ sdX );
beta = sdY * ( zbeta ./ sdX );
sigma = zsigma * sdY;
} "
Gamma distribution prior for sigmaBeta is selected to have relatively low mode 1.
<- seq(from = .00001, to = 10, by = .001)
xGamma plot(xGamma,dgamma(xGamma,shape=2.3,rate=1.3),type="l")
which.max(dgamma(xGamma,shape=2.3,rate=1.3))] xGamma[
[1] 1.00001
Create DSO.
<- stan_model(model_code = modelString) RegressionShrinkDso
If saved DSO is used load it, then run the chains.
# save(RegressionShrinkDso, file = "data/DSOShrunkMultRegr.Rds")
load("data/DSOShrunkMultRegr.Rds")
Generate Markov chains in case of 2 significant predictors.
<-proc.time()
tStart# fit model
<- sampling(RegressionShrinkDso,
fit4 data=dataListRegression,
pars=c('beta0', 'beta', 'nu', 'sigma', 'sigmaBeta'),
iter=5000, chains = 2, cores = 2
)<-proc.time()
tEnd-tStart tEnd
user system elapsed
0.40 0.25 18.60
Analyze fitted model using shinystan
launch_shinystan(fit4)
stan_dens(fit4)
stan_ac(fit4, separate_chains = T)
summary(fit4)$summary[,c(1,3,4,8,9)]
mean sd 2.5% 97.5% n_eff
beta0 4.5843612 0.228604011 4.1393299 5.027146 8748.136
beta[1] 1.0735281 0.010973133 1.0524275 1.095341 8200.715
beta[2] 2.9950789 0.007839406 2.9796239 3.009790 7482.511
nu 61.3810270 35.734724611 17.0630928 149.281989 5570.822
sigma 0.9613439 0.033176999 0.8953832 1.028195 5375.540
sigmaBeta 1.3993861 0.899552223 0.2216032 3.632021 6406.796
lp__ 1010.3909262 1.749705508 1006.2179766 1012.779996 2263.114
pairs(fit4,pars=c("beta0","beta[1]","beta[2]"))
plot(fit4,pars="nu")
plot(fit4,pars="sigma")
plot(fit4,pars="beta0")
plot(fit4,pars="beta[1]")
plot(fit4,pars="beta[2]")
Compare posterior mean values and 95% HDI with fit1
(same model, but with no shrinkage).
cbind(summary(fit1)$summary[1:3,c(1,4,8)],
summary(fit4)$summary[1:3,c(1,4,8)])
mean 2.5% 97.5% mean 2.5% 97.5%
beta0 4.578031 4.130417 5.025614 4.584361 4.139330 5.027146
beta[1] 1.073828 1.052198 1.095054 1.073528 1.052427 1.095341
beta[2] 2.995153 2.979745 3.011046 2.995079 2.979624 3.009790
Mean values of both fits seem very similar.
Check widths of the HDI for coefficients.
cbind(summary(fit1)$summary[1:3,c(8)]-summary(fit1)$summary[1:3,c(4)],
summary(fit4)$summary[1:3,c(8)]-summary(fit4)$summary[1:3,c(4)])
[,1] [,2]
beta0 0.89519719 0.88781592
beta[1] 0.04285560 0.04291323
beta[2] 0.03130079 0.03016606
Shrinkage can be noticed after third digit of all coefficients.
In this example both slopes are significant and they practically did not
shrink.
For comparison fit linear model, ridge and lasso regressions to the same data.
1- Linear model.
<-lm(dataListRegression$y~dataListRegression$x[,1]+dataListRegression$x[,2]) lmFit
2- Ridge.
library(glmnet)
set.seed(15)
<- cv.glmnet(x = dataListRegression$x,
cv.outRidge y = dataListRegression$y,
alpha=0)
plot(cv.outRidge)
<-cv.outRidge$lambda.min) (bestlam
[1] 1.680471
<- glmnet(x = dataListRegression$x, y=dataListRegression$y,
ridgeFit alpha = 0, lambda = bestlam, standardize = F)
<- predict(ridgeFit,type="coefficients", s = bestlam)) (ridge.coef
3 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) 4.768742
V1 1.068697
V2 2.986144
3- Lasso.
set.seed(15)
<- cv.glmnet(x = dataListRegression$x,
cv.outLasso y = dataListRegression$y,
alpha=1)
plot(cv.outLasso)
<-cv.outLasso$lambda.min) (bestlam
[1] 0.08363741
<-glmnet(x=dataListRegression$x,y=dataListRegression$y,
lassoFitalpha=1,lambda=bestlam,standardize = F)
<-predict(lassoFit,type="coefficients",s=bestlam)) (lasso.coef
3 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) 4.689546
V1 1.069233
V2 2.992895
Compare coefficients from all 5 models
<-cbind(summary(fit1)$summary[1:3,c(1,4,8)],
comparisonsummary(fit4)$summary[1:3,c(1,4,8)],
Ridge=ridge.coef,
Lasso=lasso.coef,
Linear=lmFit$coefficients)
colnames(comparison)<-c(paste("NoShrinkage",c("mean","2.5%","97.5%"),sep="_"),
paste("Shrinkage",c("mean","2.5%","97.5%"),sep="_"),
"Ridge","Lasso","Linear")
t(comparison)
9 x 3 sparse Matrix of class "dgCMatrix"
beta0 beta[1] beta[2]
NoShrinkage_mean 4.578031 1.073828 2.995153
NoShrinkage_2.5% 4.130417 1.052198 2.979745
NoShrinkage_97.5% 5.025614 1.095054 3.011046
Shrinkage_mean 4.584361 1.073528 2.995079
Shrinkage_2.5% 4.139330 1.052427 2.979624
Shrinkage_97.5% 5.027146 1.095341 3.009790
Ridge 4.768742 1.068697 2.986144
Lasso 4.689546 1.069233 2.992895
Linear 4.565419 1.074206 2.995385
All models show practically no shrinkage relative to linear
model.
Both Ridge and Lasso regression have too high estimates of
intercept.
Shrink estimates from data dataListInsig
.
<-proc.time()
tStart# fit model
<- sampling (RegressionShrinkDso,
fit5 data=dataListInsig,
pars=c('beta0', 'beta', 'nu', 'sigma', 'sigmaBeta'),
iter=5000, chains = 2, cores = 2
)<-proc.time()
tEnd-tStart tEnd
user system elapsed
0.44 0.17 18.02
We can analyze fitted model with shinystan
but
stan_dens(fit5)
stan_ac(fit5, separate_chains = T)
summary(fit5)$summary[,c(1,3,4,8,9)]
mean sd 2.5% 97.5%
beta0 1.221117e+00 0.05300762 1.11769338 1.32774329
beta[1] 7.988002e-01 0.02885518 0.74227587 0.85594136
beta[2] 7.983679e-03 0.02713646 -0.04570233 0.06256174
nu 5.257628e+01 33.19871103 13.71139258 137.97027648
sigma 5.977651e-01 0.02104643 0.55653680 0.63914284
sigmaBeta 1.043781e+00 0.83037712 0.10282820 3.17391417
lp__ -1.850178e+02 1.72628557 -189.19759519 -182.70019632
n_eff
beta0 6549.983
beta[1] 6984.701
beta[2] 6777.038
nu 4536.272
sigma 5500.091
sigmaBeta 5994.464
lp__ 2157.592
pairs(fit5,pars=c("beta0","beta[1]","beta[2]"))
plot(fit5,pars="nu")
plot(fit5,pars="sigma")
plot(fit5,pars="beta0")
plot(fit5,pars="beta[1]")
plot(fit5,pars="beta[2]")
This time posterior density of \(\beta_2\) (beta[2]) is concentrated at zero.
Compare mean levels and HDI widths for fits with and without shrinkage.
cbind(summary(fit2)$summary[1:3,c(1,4,8)],
summary(fit5)$summary[1:3,c(1,4,8)])
mean 2.5% 97.5% mean 2.5%
beta0 1.217597421 1.11612951 1.31636194 1.221117106 1.11769338
beta[1] 0.799972585 0.74453626 0.85437808 0.798800184 0.74227587
beta[2] 0.009416795 -0.04397004 0.06323134 0.007983679 -0.04570233
97.5%
beta0 1.32774329
beta[1] 0.85594136
beta[2] 0.06256174
cbind(summary(fit2)$summary[1:3,c(8)]-summary(fit2)$summary[1:3,c(4)],
summary(fit5)$summary[1:3,c(8)]-summary(fit5)$summary[1:3,c(4)])
[,1] [,2]
beta0 0.2002324 0.2100499
beta[1] 0.1098418 0.1136655
beta[2] 0.1072014 0.1082641
Parameters shrunk a little more this time, second coefficient shrunk to zero.
Again, fit linear model, ridge and lasso regressions to the same data.
1- Linear model.
<-lm(dataListInsig$y~dataListInsig$x[,1]+dataListInsig$x[,2]) lmFit
2- Ridge.
set.seed(15)
=cv.glmnet(x=dataListInsig$x,y=dataListInsig$y,alpha=0)
cv.outRidgeplot(cv.outRidge)
<-cv.outRidge$lambda.min) (bestlam
[1] 0.07783229
<-glmnet(x=dataListInsig$x,y=dataListInsig$y,
ridgeFitalpha=0,lambda=bestlam,standardize = F)
<-predict(ridgeFit,type="coefficients",s=bestlam)) (ridge.coef
3 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) 1.287098608
Input1 0.739131424
Input2 0.004155875
3- Lasso.
set.seed(15)
=cv.glmnet(x=dataListInsig$x,y=dataListInsig$y,alpha=1)
cv.outLassoplot(cv.outLasso)
<-cv.outLasso$lambda.min) (bestlam
[1] 0.02067294
<-glmnet(x=dataListInsig$x,y=dataListInsig$y,
lassoFitalpha=1,lambda=bestlam,standardize = F)
<-predict(lassoFit,type="coefficients",s=bestlam)) (lasso.coef
3 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) 1.249387
Input1 0.778466
Input2 .
Compare coefficients from all 3 models.
<-cbind(summary(fit2)$summary[1:3,c(1)],
comparisonsummary(fit5)$summary[1:3,c(1)],
Ridge=ridge.coef,
Lasso=lasso.coef,
Linear=lmFit$coefficients)
colnames(comparison)<-c("NoShrinkage","Shrinkage","Ridge","Lasso","Linear")
t(comparison)
5 x 3 sparse Matrix of class "dgCMatrix"
(Intercept) Input1 Input2
NoShrinkage 1.217597 0.7999726 0.009416795
Shrinkage 1.221117 0.7988002 0.007983679
Ridge 1.287099 0.7391314 0.004155875
Lasso 1.249387 0.7784660 .
Linear 1.214802 0.8011573 0.009700042
All models correctly exclude second coefficient.
Ridge shrunk both slopes more than other models.
There is again tendency for Ridge and Lasso to overestimate
intercept.
Shrink coefficients estimated from dataListShrink2
.
<-proc.time()
tStart# fit model
<- sampling (RegressionShrinkDso,
fit6 data=dataListShrink2,
pars=c('beta0', 'beta', 'nu', 'sigma', 'sigmaBeta'),
iter=5000, chains = 2, cores = 2
)<-proc.time()
tEnd-tStart tEnd
user system elapsed
0.54 0.10 91.97
We could analyze model with shinystan
but let’s check
densities, pairs and individual plots of parameters.
stan_dens(fit6)
stan_ac(fit6, separate_chains = T)
summary(fit6)$summary[,c(1,3,4,8,9)]
mean sd 2.5% 97.5% n_eff
beta0 4.5437360 0.47198344 3.6607788 5.5095532 1845.144
beta[1] -0.5261239 0.63908279 -1.8992740 0.6344545 1575.212
beta[2] 2.5172117 0.42658592 1.6035586 3.2916493 1575.367
nu 56.5744375 34.07819768 14.7590894 143.3833364 3428.014
sigma 0.9804926 0.03427887 0.9135965 1.0470397 3347.567
sigmaBeta 1.2388283 0.85960126 0.1607706 3.3968145 3997.294
lp__ 963.8218087 1.75910037 959.4784596 966.1987987 2037.995
pairs(fit6,pars=c("beta0","beta[1]","beta[2]"))
plot(fit6,pars="nu")
plot(fit6,pars="sigma")
plot(fit6,pars="beta0")
plot(fit6,pars="beta[1]")
plot(fit6,pars="beta[2]")
Show mean values and HDI.
cbind(summary(fit3)$summary[1:3,c(1,4,8)],
summary(fit6)$summary[1:3,c(1,4,8)])
mean 2.5% 97.5% mean 2.5% 97.5%
beta0 4.5857595 3.631423 5.5544965 4.5437360 3.660779 5.5095532
beta[1] -0.5967437 -1.931495 0.7388229 -0.5261239 -1.899274 0.6344545
beta[2] 2.4699613 1.576532 3.3619226 2.5172117 1.603559 3.2916493
cbind(summary(fit3)$summary[1:3,c(8)]-summary(fit3)$summary[1:3,c(4)],
summary(fit6)$summary[1:3,c(8)]-summary(fit6)$summary[1:3,c(4)])
[,1] [,2]
beta0 1.923073 1.848774
beta[1] 2.670318 2.533728
beta[2] 1.785391 1.688091
In this example \(\beta_1\) shrunk
more significantly and is not different from zero.
At the same time \(\beta_2\) has become
more different from zero.
Regularization reinforced one of the two correlated predictors while
dumping the other.
Again, fit linear model, ridge and lasso regressions to the same data.
1- Linear model.
<-lm(dataListShrink2$y~dataListShrink2$x[,1]+dataListShrink2$x[,2])
lmFitsummary(lmFit)
Call:
lm(formula = dataListShrink2$y ~ dataListShrink2$x[, 1] + dataListShrink2$x[,
2])
Residuals:
Min 1Q Median 3Q Max
-2.7765 -0.7585 0.0172 0.6577 3.1721
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.5619 0.4878 9.353 < 2e-16 ***
dataListShrink2$x[, 1] -0.5468 0.6707 -0.815 0.415
dataListShrink2$x[, 2] 2.5034 0.4479 5.589 3.76e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9988 on 497 degrees of freedom
Multiple R-squared: 0.9962, Adjusted R-squared: 0.9962
F-statistic: 6.53e+04 on 2 and 497 DF, p-value: < 2.2e-16
2- Ridge.
set.seed(15)
=cv.glmnet(x=dataListShrink2$x,y=dataListShrink2$y,alpha=0)
cv.outRidgeplot(cv.outRidge)
<-cv.outRidge$lambda.min) (bestlam
[1] 1.61435
<-glmnet(x=dataListShrink2$x,y=dataListShrink2$y,
ridgeFitalpha=0,lambda=bestlam,standardize = F)
<-predict(ridgeFit,type="coefficients",s=bestlam)) (ridge.coef
3 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) 4.943942
x1 -1.425696
x2 1.910632
3- Lasso.
set.seed(15)
=cv.glmnet(x=dataListShrink2$x,y=dataListShrink2$y,alpha=1)
cv.outLassoplot(cv.outLasso)
<-cv.outLasso$lambda.min) (bestlam
[1] 0.1062134
<-glmnet(x=dataListShrink2$x,y=dataListShrink2$y,
lassoFitalpha=1,lambda=bestlam,standardize = F)
<-predict(lassoFit,type="coefficients",s=bestlam)) (lasso.coef
3 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) 4.116010
x1 .
x2 2.865189
Compare coefficients from all 3 models.
<-cbind(summary(fit3)$summary[1:3,c(1)],
comparisonsummary(fit6)$summary[1:3,c(1)],
Ridge=ridge.coef,
Lasso=lasso.coef,
Linear=lmFit$coefficients)
colnames(comparison)<-c("NoShrinkage","Shrinkage","Ridge","Lasso","Linear")
t(comparison)
5 x 3 sparse Matrix of class "dgCMatrix"
(Intercept) x1 x2
NoShrinkage 4.585760 -0.5967437 2.469961
Shrinkage 4.543736 -0.5261239 2.517212
Ridge 4.943942 -1.4256958 1.910632
Lasso 4.116010 . 2.865189
Linear 4.561922 -0.5468278 2.503444
All models correctly exclude first slope.
Lasso does it decisively, making slope \(\beta_1\) and exactly equal to \(0\).
Lasso also estimated intercept and \(\beta_2\) more accurately than other
models: recall that for this data set we simulate \(\beta_0=4, \beta_2=3\).
Analysis of SAT scores, example from Kruschke, 2015, section 18.3.
These data are analyzed in the article by Deborah Lynn Guber.
The variables observed are:
Read the data from file Guber1999data.csv
available at
Kruschke,
2015.
= read.csv("data/Guber1999data.csv") # section 18.3 @ Kruschke
myData head(myData)
State Spend StuTeaRat Salary PrcntTake SATV SATM SATT
1 Alabama 4.405 17.2 31.144 8 491 538 1029
2 Alaska 8.963 17.6 47.951 47 445 489 934
3 Arizona 4.778 19.3 32.175 27 448 496 944
4 Arkansas 4.459 17.1 28.934 6 482 523 1005
5 California 4.992 24.0 41.078 45 417 485 902
6 Colorado 5.443 18.4 34.571 29 462 518 980
pairs(myData[,-c(1,6:7)])
plot(myData$Spend,myData$SATT)
summary(lm(myData$SATT~myData$Spend))$coeff
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1089.29372 44.389950 24.539197 8.168276e-29
myData$Spend -20.89217 7.328209 -2.850925 6.407965e-03
The plots show that mean SAT score is negatively correlated with amount of money states spend per student. These results were used in hot debates about spending money on education to support argument in favor of reducing public support for schools.
Prepare the data.
Use the 2 predictors from the file, plus add 12 randomly generated nuisance predictors.
<- nrow(myData)
Ntotal <- myData$SATT
y <- cbind(myData$Spend, myData$PrcntTake)
x colnames(x) <- c("Spend","PrcntTake");
<- list(Ntotal=Ntotal,y=y,x=x,Nx=ncol(x))
dataList2Predict # generate 12 spurious predictors:
set.seed(47405)
<- 12
NxRand for (xIdx in 1:NxRand) {
= rnorm(Ntotal)
xRand = cbind(x, xRand )
x colnames(x)[ncol(x)] = paste0("xRand", xIdx)
}<- list(Ntotal=Ntotal,y=y,x=x,Nx=ncol(x)) dataListExtraPredict
Use the same model as in the example of the first section:
RobustMultipleRegressionDso
.
First, run the model with 2 predictors.
<- sampling (RobustMultipleRegressionDso,
fit_noshrink2Pred data=dataList2Predict,
pars=c('beta0', 'beta', 'nu', 'sigma'),
iter=5000, chains = 2, cores = 2)
summary(fit_noshrink2Pred)$summary[,c(1,4,8)]
mean 2.5% 97.5%
beta0 991.4007496 947.253200 1036.31532
beta[1] 12.8196529 4.235574 21.51116
beta[2] -2.8747599 -3.308695 -2.44884
nu 33.4879111 3.455822 114.56115
sigma 31.6173398 24.425564 39.83589
lp__ -0.1425401 -4.083100 1.96456
It is clear that the slope of Spend
is significantly
positive and slope of PrcntTake
is significantly
negative.
This shows that the negative correlation between SAT scores and the money spent as seen from the scatterplot is illusory: fewer students from underfunded schools take SAT, but these are only students who apply for colleges; students who potentially would receive low SAT scores do not apply to college and do not take the test.
Run MCMC for the model with additional nuisance predictors.
<- sampling (RobustMultipleRegressionDso,
fit_noshrinkExtra data=dataListExtraPredict,
pars=c('beta0', 'beta', 'nu', 'sigma'),
iter=5000, chains = 2, cores = 2)
Analyze the output with shinystan
.
launch_shinystan(fit_noshrinkExtra)
Here are the results of MCMC.
stan_ac(fit_noshrinkExtra, separate_chains = T)
pairs(fit_noshrinkExtra,pars=c("beta0","beta[1]","beta[2]"))
plot(fit_noshrinkExtra,pars=c('beta'))
stan_dens(fit_noshrinkExtra,pars=c("beta0","beta"))
All densities look symmetrical: mean values of posterior distributions can be used as point estimates of betas.
summary(fit_noshrinkExtra)$summary[,c(1,4,8)]
mean 2.5% 97.5%
beta0 998.618086 952.442828 1044.8766561
beta[1] 10.234291 1.233829 19.2611967
beta[2] -2.721493 -3.179022 -2.2554795
beta[3] 2.304234 -11.450917 15.7354294
beta[4] -4.728814 -14.889133 5.4752469
beta[5] 6.250301 -7.252438 19.1997960
beta[6] -5.696532 -15.225194 3.7955166
beta[7] 6.989249 -2.925123 16.8434431
beta[8] 1.908281 -8.072728 12.2999566
beta[9] 4.283368 -6.011922 14.5423902
beta[10] 2.651897 -11.337831 17.1438898
beta[11] -4.221621 -13.803873 5.1213425
beta[12] -10.363432 -20.155584 -0.7626764
beta[13] -1.894524 -12.602323 8.8656771
beta[14] 1.710674 -8.973928 12.1135095
nu 31.187292 2.027864 105.5120668
sigma 30.255439 21.367424 39.5905702
lp__ 1.240241 -6.398344 6.6903196
The variables corresponding to betas are:
colnames(x)
[1] "Spend" "PrcntTake" "xRand1" "xRand2" "xRand3"
[6] "xRand4" "xRand5" "xRand6" "xRand7" "xRand8"
[11] "xRand9" "xRand10" "xRand11" "xRand12"
Note that the coefficient for variable Spend
is still
positive, but the left side of HDI interval is much closer to zero. The
coefficient for PrcntTake
is still significantly
negative.
One of the nuisance predictors happened to be significantly negative:
beta[12]
.
As a result of adding nuisance predictors the accuracy of inference becomes lower.
Analyze the same data with the model encouraging shrinkage of parameters.
First, fit the model without nuisance parameters.
<- sampling (RegressionShrinkDso,
fit_shrink data=dataList2Predict,
pars=c('beta0', 'beta', 'nu', 'sigma', 'sigmaBeta'),
iter=5000, chains = 2, cores = 2)
Check convergence in shiny
.
launch_shinystan(fit_shrink)
pairs(fit_shrink,pars=c("beta0","beta","nu","sigma","sigmaBeta"))
plot(fit_shrink,pars=c('beta'))
stan_dens(fit_shrink,pars=c('beta'))
stan_ac(fit_shrink, separate_chains = T)
Compare with the fit without nuisance parameters and without shrinkage.
cbind(summary(fit_noshrink2Pred)$summary[1:4,c(1,4,8)],
summary(fit_shrink)$summary[1:4,c(1,4,8)])
mean 2.5% 97.5% mean 2.5%
beta0 991.40075 947.253200 1036.31532 996.071254 951.012629
beta[1] 12.81965 4.235574 21.51116 11.770134 2.632324
beta[2] -2.87476 -3.308695 -2.44884 -2.831416 -3.275607
nu 33.48791 3.455822 114.56115 33.323074 3.756878
97.5%
beta0 1042.051358
beta[1] 20.514489
beta[2] -2.361973
nu 109.178594
First variable shrunk closer to zero: mean value is smaller and left end of the 95%-HDI is closer to zero.
Now fit the model with additional parameters.
<- sampling (RegressionShrinkDso,
fit_shrinkExtra data=dataListExtraPredict,
pars=c('beta0', 'beta', 'nu', 'sigma', 'sigmaBeta'),
iter=5000, chains = 2, cores = 2)
stan_ac(fit_shrinkExtra, separate_chains = T)
pairs(fit_shrinkExtra,pars=c("beta0","beta[1]","beta[2]","beta[3]","beta[4]","beta[11]","beta[12]"))
pairs(fit_shrinkExtra,pars=c("nu","sigma","sigmaBeta"))
plot(fit_shrinkExtra,pars=c('beta'))
stan_dens(fit_shrinkExtra,pars=c('beta'))
Note characteristic pinched tips of posterior densities for shrunk variables.
summary(fit_shrinkExtra)$summary[,c(1:4,8)]
mean se_mean sd 2.5%
beta0 1009.82602144 4.473582390 26.5068325 9.628679e+02
beta[1] 8.21954511 1.078486389 5.2814904 -2.802567e-01
beta[2] -2.68755852 0.033018532 0.2279845 -3.164852e+00
beta[3] 1.16429414 0.538365949 3.1397865 -4.187581e+00
beta[4] -1.16002937 0.229745431 2.6666174 -8.523255e+00
beta[5] 1.78068242 0.492660992 3.0628568 -2.483201e+00
beta[6] -2.47265483 1.152596733 3.7247172 -1.093501e+01
beta[7] 2.76863569 1.135585515 3.9366726 -1.659493e+00
beta[8] 0.86387877 0.267798818 2.3636293 -2.570624e+00
beta[9] 1.36957487 0.314802537 2.8913159 -2.288281e+00
beta[10] -1.39643294 1.838822871 3.7474360 -1.012010e+01
beta[11] -1.00165428 0.236339260 2.3005006 -7.569141e+00
beta[12] -6.61450384 0.710490722 5.4495075 -1.794561e+01
beta[13] -0.38037233 0.100166767 1.9872390 -5.869640e+00
beta[14] 0.76980004 0.466670211 2.3077738 -3.803671e+00
nu 40.07666808 5.561396626 34.6493271 3.403896e+00
sigma 30.81178901 0.343535668 3.8888139 2.332439e+01
sigmaBeta 0.01966483 0.002562242 0.0270782 5.482108e-04
lp__ 21.13792207 0.824755562 6.4649353 9.719103e+00
97.5%
beta0 1058.28908450
beta[1] 18.23938239
beta[2] -2.24414288
beta[3] 8.71009792
beta[4] 3.08755229
beta[5] 9.02094911
beta[6] 1.45812939
beta[7] 11.60701517
beta[8] 7.02230696
beta[9] 9.15217808
beta[10] 5.87438345
beta[11] 2.47309092
beta[12] 0.61143385
beta[13] 3.18343802
beta[14] 6.30005198
nu 129.97213782
sigma 38.72621370
sigmaBeta 0.09534239
lp__ 35.39122557
Parameter beta[12]
has shrunk to zero based on 95%-HDI
as a result of regularized model.
This helped removing all nuisance parameters. But shrinkage also removed
parameter beta[1]
of variable Spend
!!!
Compare with linear model.
Without nuisance predictors:
<-lm(y~x[,1]+x[,2])
lmSATsummary(lmSAT)
Call:
lm(formula = y ~ x[, 1] + x[, 2])
Residuals:
Min 1Q Median 3Q Max
-88.400 -22.884 1.968 19.142 68.755
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 993.8317 21.8332 45.519 < 2e-16 ***
x[, 1] 12.2865 4.2243 2.909 0.00553 **
x[, 2] -2.8509 0.2151 -13.253 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 32.46 on 47 degrees of freedom
Multiple R-squared: 0.8195, Adjusted R-squared: 0.8118
F-statistic: 106.7 on 2 and 47 DF, p-value: < 2.2e-16
confint(lmSAT)
2.5 % 97.5 %
(Intercept) 949.908859 1037.754459
x[, 1] 3.788291 20.784746
x[, 2] -3.283679 -2.418179
With nuisance predictors:
<-lm(y~.,data=as.data.frame(cbind(y,x)))
lmSATAllsummary(lmSATAll)
Call:
lm(formula = y ~ ., data = as.data.frame(cbind(y, x)))
Residuals:
Min 1Q Median 3Q Max
-61.485 -17.643 1.093 15.349 64.549
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1002.601 22.231 45.100 < 2e-16 ***
Spend 9.503 4.313 2.203 0.0342 *
PrcntTake -2.703 0.228 -11.853 8.29e-14 ***
xRand1 2.164 6.842 0.316 0.7536
xRand2 -5.190 5.015 -1.035 0.3078
xRand3 6.424 6.599 0.974 0.3370
xRand4 -5.678 4.899 -1.159 0.2542
xRand5 7.363 4.957 1.485 0.1464
xRand6 1.606 5.074 0.316 0.7536
xRand7 3.909 5.034 0.777 0.4426
xRand8 3.060 7.072 0.433 0.6679
xRand9 -4.654 4.397 -1.058 0.2971
xRand10 -10.265 4.816 -2.131 0.0401 *
xRand11 -2.912 5.252 -0.555 0.5827
xRand12 1.334 5.258 0.254 0.8013
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 31.51 on 35 degrees of freedom
Multiple R-squared: 0.8733, Adjusted R-squared: 0.8226
F-statistic: 17.23 on 14 and 35 DF, p-value: 1.083e-11
confint(lmSATAll)[2:3,2]-confint(lmSATAll)[2:3,1]
Spend PrcntTake
17.5101430 0.9258323
confint(lmSAT)[2:3,2]-confint(lmSAT)[2:3,1]
x[, 1] x[, 2]
16.9964551 0.8655003
These also show that addition of nuisance parameters widened confidence intervals.
set.seed(15)
=cv.glmnet(x=dataListExtraPredict$x,y=dataListExtraPredict$y,alpha=0)
cv.outRidgeplot(cv.outRidge)
<-cv.outRidge$lambda.min) (bestlam
[1] 6.570761
<-glmnet(x=dataListExtraPredict$x,y=dataListExtraPredict$y,
ridgeFitalpha=0,lambda=bestlam,standardize = F)
<-predict(ridgeFit,type="coefficients",s=bestlam)
ridge.coefset.seed(15)
=cv.glmnet(x=dataListExtraPredict$x,y=dataListExtraPredict$y,alpha=1)
cv.outLassoplot(cv.outLasso)
<-cv.outLasso$lambda.min) (bestlam
[1] 7.732551
<-glmnet(x=dataListExtraPredict$x,y=dataListExtraPredict$y,
lassoFitalpha=1,lambda=bestlam,standardize = F)
<-predict(lassoFit,type="coefficients",s=bestlam)
lasso.coef<-round(cbind(summary(lmSATAll)$coefficients[,c(1,4)],
comparisonsummary(fit_noshrinkExtra)$summary[1:15,c(1,4,8)],
summary(fit_shrinkExtra)$summary[1:15,c(1,4,8)],
3)
ridge.coef, lasso.coef),<-as.matrix(comparison)
comparisoncolnames(comparison)<-c("LM","LM-Pv","NoShrink","NoShrink-L","NoShrink-H",
"Shrink","Shrink-L","Shrink-H","Ridge","Lasso")
comparison
LM LM-Pv NoShrink NoShrink-L NoShrink-H Shrink
(Intercept) 1002.601 0.000 998.618 952.443 1044.877 1009.826
Spend 9.503 0.034 10.234 1.234 19.261 8.220
PrcntTake -2.703 0.000 -2.721 -3.179 -2.255 -2.688
xRand1 2.164 0.754 2.304 -11.451 15.735 1.164
xRand2 -5.190 0.308 -4.729 -14.889 5.475 -1.160
xRand3 6.424 0.337 6.250 -7.252 19.200 1.781
xRand4 -5.678 0.254 -5.697 -15.225 3.796 -2.473
xRand5 7.363 0.146 6.989 -2.925 16.843 2.769
xRand6 1.606 0.754 1.908 -8.073 12.300 0.864
xRand7 3.909 0.443 4.283 -6.012 14.542 1.370
xRand8 3.060 0.668 2.652 -11.338 17.144 -1.396
xRand9 -4.654 0.297 -4.222 -13.804 5.121 -1.002
xRand10 -10.265 0.040 -10.363 -20.156 -0.763 -6.615
xRand11 -2.912 0.583 -1.895 -12.602 8.866 -0.380
xRand12 1.334 0.801 1.711 -8.974 12.114 0.770
Shrink-L Shrink-H Ridge Lasso
(Intercept) 962.868 1058.289 1005.862 1027.326
Spend -0.280 18.239 8.932 5.032
PrcntTake -3.165 -2.244 -2.697 -2.607
xRand1 -4.188 8.710 2.061 0.000
xRand2 -8.523 3.088 -5.024 0.000
xRand3 -2.483 9.021 5.004 0.000
xRand4 -10.935 1.458 -5.064 0.000
xRand5 -1.659 11.607 6.686 0.000
xRand6 -2.571 7.022 1.916 0.000
xRand7 -2.288 9.152 3.828 0.000
xRand8 -10.120 5.874 2.090 0.000
xRand9 -7.569 2.473 -4.497 0.000
xRand10 -17.946 0.611 -9.411 -4.142
xRand11 -5.870 3.183 -2.396 0.000
xRand12 -3.804 6.300 0.807 0.000
Note that there is no way to extract from ridge and lasso regressions any measure for comparison with zero, like confidence intervals.
Linear model keeps both Spend
and PrcntTake
and removes with 5% level all nuisance coefficients except
xRand10
Bayesian model without shrinkage does the same.
Bayesian model with shrinkage shrinks to zero all artificial predictors,
but it also removes Spend.
Ridge in general is consistent with linear model, but it is not clear if
it shrinks any parameters to zero or not. Lasso fails to shrink to zero
several artificial parameters.
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/hai-mn/hai-mn.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Nguyen (2022, Jan. 31). HaiBiostat: Series 4 of 10 -- Fitting Linear Models - Multiple Regression. Retrieved from https://hai-mn.github.io/posts/2022-01-31-Bayesian methods - Series 4 of 10/
BibTeX citation
@misc{nguyen2022series, author = {Nguyen, Hai}, title = {HaiBiostat: Series 4 of 10 -- Fitting Linear Models - Multiple Regression}, url = {https://hai-mn.github.io/posts/2022-01-31-Bayesian methods - Series 4 of 10/}, year = {2022} }